Prune index for <package>-<version> before updating#61
Conversation
|
Should we reprocess all packages to delete the duplicates? |
If it's not too hard -- yes, and probably only for the special_packages (but I am not sure). Alternatively the duplicates could be removed with a "script" that would go through all current special packages docs and deduplicate each |
No need IMO. If there are duplicates, as soon as the maintainer does |
|
For example, once this is merged, the duplicates for Elixir will automatically be fixed once we commit to main/v1.19. So we should be fine. |
|
I also see |
It can also happen for conventional packages since docs can be republished at any time. Docs don't have the same limitations as republishing packages. |
|
I will reprocess all packages this weeked, it will also be useful to find if we created any new regressions in the pipeline. |
One thing I noticed about delete operations is that they can be very slow (several minutes) for queries affecting a lot of docs -- like deleting |
|
Maybe we should do it for elixir-main explicitly then? Cause it will probably take forever with all duplicates! |
Yes,
I have already deleted it (but it has been re-indexed with fewer or no duplicates since) ... that's how I found out it was slow ... These were the counts a few days ago (pre-delete): select package, count(*), count(*) / (select count(*) from 'documents-export-hexdocs-prod-10-21-2025--7-38-08-PM.jsonl') * 100 as '%' from 'documents-export-hexdocs-prod-10-21-2025--7-38-08-PM.jsonl' group by package order by 2 desc limit 40; |
This should help resolve Elixir
mainand similar duplicates.